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Abstract 

We present a probabilistic method for linking multiple datafiles. This task is not trivial 
in the absence of unique identifiers for the individuals recorded. This is a common scenario 
when linking census data to coverage measurement surveys for census coverage evaluation, 
and in general when multiple record-systems need to be integrated for posterior analysis. Our 
method generalizes the Fellegi-Sunter theory for linking records from two datafiles and its 
modern implementations. The goal of multiple record linkage is to classify the record isT-tuples 
coming from K datafiles according to the different matching patterns. Our method incorpo- 
rates the transitivity of agreement in the computation of the data used to model matching 



probabilities. We use a mixture model to fit matching probabilities via maximum likelihood 
using the EM algorithm. We present a method to decide the record i^T-tuples membership to 
the subsets of matching patterns and we prove its optimality. We apply our method to the 
integration of the three Colombian homicide record systems and perform a simulation study 
to explore the performance of the method under measurement error and different scenarios. 
The proposed method works well and opens new directions for future research. 

Key words and phrases: Bell number; Census undercount; Data linkage; Data matching; 
EM algorithm; Mixture model; Multiple systems estimation; Partially ordered set. 



1 INTRODUCTION 

Record linkage is a widely-used technique for identifying records that refer to the same 
individual across different datafiles. This task is not trivial when unique identifiers are not 
available, and many authors have proposed probabilistic methods to deal with this problem 



building upon the seminal work of Newcombe et al. ( 1959 ) and Fellegi and Sunter ( 1969 ) 



Applications of record linkage include merging post-enumeration surveys and census data for 



census coverage evaluation (e.g., Winkler 1988; Jaro 1989; Winkler and Thibaudeau, 1991) 



linking health-care databases for epidemiological studies (e.g., Bell et al. , 1994 Meray et al 



2007), and adaptive name matching in information integration (Bilenko et al. 2003) among 



others. 



1.1 Linking Multiple Datafiles 

To perform record linkage involving more than two datafiles, some authors have used record 



linkages for each pair of datafiles or other ad hoc procedures (e.g., see Darroch et al. 1993 



Zaslavsky and Wolfgang[ [19931 |Asher and Fienberg[ |2001t |Asher et al.[ [20031 [Meray et al 



2007 ) . Separate pairwise matchings of datafiles do not guarantee the transitivity of the linkage 



decisions and thus require resolving discrepancies ( Fienberg and Manrique- Vallier , 2009j) . For 
example, let us suppose we link the record of the individual a in a first datafile and the record 
of an individual 6 in a second datafile from a bipartite record linkage (classical record linkage 
of two datafiles). Then, from a second bipartite record linkage, we link the record of b to the 
record of an individual c in a third datafile. Based on these two linkages we might conclude 
that a, b, and c are the same individual. Unfortunately, had we also linked the first and third 
files, a and c may not match. If a, b, and c truly correspond to the same individual, the 
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non-match could occur due to measurement error or incomplete record information. On the 
other hand, if the records of a, b, and c do not refer to the same individual, we have four 
possibilities: a and b refer to the same individual but c refers to another one, a and c refer 
to the same individual but b refers to another one, b and c refer to the same individual but 
a refers to another one, or all a, b, and c refer to different individuals. By using bipartite 
record linkage for each pair of files we cannot resolve the matching pattern for these three 
records. While there are various ad hoc approaches to resolve the results of multiple bipartite 
matchings, no formal methodology has appeared in the statistical literature (e.g., see the 
recent surveys of Herzog, Scheuren and Winkler 2007, 2010). 



1.2 Census and Record— Systems Coverage Evaluation 



Implementation of accurate methods for census coverage evaluation and possibly census ad- 
justment requires the integration of multiple datafiles. The usual methodology of census 
coverage evaluation matches a coverage measurement survey to the census data in order to 



estimate population sizes using dual-system estimation (Hogan, 1992, 1993). This procedure 
is subject to "correlation bias," which results when responses to the census and survey are 



dependent or the joint inclusion probabilities are heterogeneous (Darroch et al. , 1993 Za- 



slavsky and Wolfgang, 1993 Anderson and Fienberg 1999). The incorporation of additional 



surveys or administrative data into the coverage evaluation process allows for checking on 
assumptions regarding independence of lists and homogeneity, and for modeling departures 
from them. This in turn requires attention to the problem of multiple record linkage. 

Likewise, under-registration is the norm rather than the exception in record-systems of 
human rights violations and violent events in general, especially in countries with high levels 
of violence. Discrepancies appear whenever there are different record-systems capturing in- 
formation about the same event of interest. The diversity of sources provides a useful input 
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et al. , 2010). A clear example of this scenario occurs in Colombia, where there exist three 



homicide record-systems which usually differ in the number of recorded casualties. Those 
record-systems are maintained by the Colombian Census Bureau (Departamento Adminis- 
trativo Nacional de Estadistica - DANE, in Spanish), the Colombian National Police (Policia 
Nacional de Colombia) , and the Colombian Forensics Institute (Instituto Nacional de Medic- 
ina Legal y Ciencias Forenses). The discrepancies in the numbers recorded by these record- 
systems are the result of conceptual and methodological differences among these institutions, 
as well as problems of geographical coverage (Restrepo and Aguirre, 2007). Whereas the data 
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from the National Police and Forensics Institute simply record the information obtained from 
their daily activities, the objective of the Colombian Census Bureau is to determine the true 



number of deaths occurring in Colombia and its geographical subdivisions ( Departamento 



Administrativo Nacional de Estadisticas, DANE 2009). Thus, the coverage evaluation of the 



Colombian Census Bureau record-system is important, and its linkage with the other two 
sources can lead to improved estimates of the number of homicides. 



1.3 Overview of the Article 



We propose a method for the linkage of multiple datafiles, generalizing the theory of Fellegi 



and Sunter (1969) and the implementations presented by Winkler (1988) and Jaro (1989) 



which still represent the mainstream approach for unsupervised record linkage (see Copas 



and Hilton (1990) for a supervised approach). Our method incorporates the transitivity of 



agreement in the computation of the data used to model matching probabilities. In Section 
[2] we generalize the set of record pairs presented by Fellegi and Sunter (1969) to a K-ary 
product of the K datafiles to be linked, and we present this K-ary product as the union 
of all the possible subsets that contain the possible patterns of agreement of the record K- 
tuples. In Section|3]we propose a method to compute comparison data from record ET-tuples, 
incorporating transitivity, and we present a way to schematize this kind of data through simple 
graphs. In order to fit matching probabilities, in Section [4] we generalize the mixture model 



used by Winkler (1988) and Jaro (1989), and in Section 5 we present details of the fitting 



of this model using the EM algorithm (Dempster, Laird and Rubin, 1977). In Section [6] we 
present an optimal method to decide the record -fC-tuples membership to the subsets defined 
in Section [2j Section [7] contains an application of the proposed methods to the integration 
of the three Colombian homicide record-systems and Section [8] describes simulation studies 
where we explore the performance of the method under different scenarios. 



2 COVERED SUBPOPULATIONS AND RECORD 
i^^-TUPLES 



We follow the exposition of Fellegi and Sunter ( 1969 ) and suppose some population is recorded 



by K datafiles. Let Ai, A2, ■ ■ ■ , Ak denote the K overlapping subpopulations recorded in 
those K datafiles. Now, suppose that for each datafile there exists one different record 
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generating process a^, which produces a set of records denoted by 



afe(^fe) = {cukiak); CLk G ^fc}, k = l,...,K 

where the member a^iak) represents a vector of information of the member € A^. This 
information could be subject to measurement error or incomplete. Let us define the K-aiy 
cartesian product 

K 

^ak{Ak) = {{ai{ai),a2{a2),...,aK{aK))]ak£ Ak,k = l,...,K] 
k=l 

which is composed by all the possible record X-tuples in which the kth entry corresponds to 
the information recorded for some ak in the subpopulation k. Now we describe the possible 
matching patterns of the record if -tuples in terms of the members of the subpopulations Ak- 
First, it is possible that a record ET-tuple includes information on K different individuals, i.e., 
for some (ai(ai), 02(02); • • • ,ctK{aK)), CLk 7^ ak', for all k 7^ k' . At the other extreme, if an 
individual appears in all K datafiles, then in the record i^T-tuple (ai(ai), 02(02), • . . , aft:(ax)) 
actually ai = a2 = ■ ■ ■ = ax- In general, we can classify the entries of each record if-tuple 
into subsets that record information on the same individual. 

In order to establish this idea formally, let Pj^ denote the set of partitions of the set 
Ni^ = {1, 2, . . . , if}. If we associate each number in Ni^ with an entry in a record if -tuple, 
then the matching pattern of each record if -tuple corresponds to a partition of Nx, where the 
elements of the partition group the entries of the if -tuple that represent the same individual. 
Now, let Sp denote the set of record if -tuples corresponding to the matching pattern p G ¥k- 
It is clear that 

K 

(g)afc(Afc) = [j Sp (1) 

k=l p&K 

since each record if -tuple has a unique matching pattern. The number of ways we can 
partition a set of if elements into nonempty subsets is called the ifth Bell number, denoted 
Bk, which can be found using the recurrence relation Bk = 'Ylik=Q ■, with i?o = 1 

by convention (see Rota, 1964, for further details). Thus, there are Bk subsets Sp of record 



if -tuples. 

Let n denote the cardinality of the set in equation ([T]). Also, for j = 1, . . . ,n, let Vj = 
(ai(ai), . . . , aKiiK)) for some ak G Ak, A; = 1, . . . , if , be the jth record if -tuple of the if-ary 
product in equation ([T]). When the datafiles do not contain common identifiers, we cannot 
identify the subset Sp to which the record if -tuple rj belongs. If the datafiles record the same 
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F fields of information, however, we can obtain a comparison vector for each iC-tuple rj. 
We can use this information to estimate the probabihty that each record K-tuple belongs to 
each subset Sp, given the comparison vector 7-^. Multiple record linkage's goal is to classify 
all the record /C-tuples in the appropriate subsets Sp. 

Example. If we have K = 3 datafiles, for each triplet of records we have the matching 
patterns in Table [T| which can be represented using undirected graphs as in Figure [Tj In this 
case, we also have B3 = 5 and the cartesian product of the three datafiles can be written as 

3 

afc(Afc) = Si/2/3 U 5i2/3 U 513/2 U 5i/23 U 5i23- (2) 

k=l 



Table 1: Each matching pattern of a record triplet 
can be associated with a partition of the set 
{1,2,3}. 



Notation 


F3 


(ai(ai),a2(a2),a3(a3)) 


1/2/3 


{{1},{2},{3}} 


0-1 ^ 0,2 ^ a-s ^ ai 


12/3 


{{1,2},{3}} 


ai = a2', 0-3 7^ cti, 0-2 


13/2 


{{1,3},{2}} 


0-1 = CI3; 0-2 7^ O-i, 03 


1/23 


{{1},{2,3}} 


0-2 = Oi 7^ 0,2, '3'3 


123 


{{1,2,3}} 


CL-^ = (I2 — O3 

















e 




\ 









Figure 1: Undirected graphs giving = 5 possible patterns of agreement using 
three datafiles. The vertices appear connected if the value that each one represents 
agree, otherwise, the vertices appear unconnected. 

2.1 Blocking 

Note that the dimension of the i^'-ary product grows exponentially as a function of K. Thus, 
considering the complete set of record iT-tuples is highly inefficient in most applications. A 
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common way to deal with this problem in bipartite record linkage is to partition each datafile 
into a common set of blocks, thereby eliminating the need to match records in different 
blocks. The idea is that reliable categorical fields such as zip code or gender may be used to 
quickly label some of the non-links. For example, if we are matching datafiles with geographic 
information, we could assign those records that differ in zip code (or a similar field) as non- 



links. See Herzog et al. (2007, 2010) and Christen (2012) for a discussion of blocking. 

In multiple record linkage we can apply the same idea to assign non-links between pairs 
of records within every record /C-tuple. If a certain blocking variable assigns a non-link 
between records k and k' in the record X-tuple r^, this implies that Vj cannot be assigned 
to subsets Sp where the pattern of agreement p involves a link between files k and k' . Con- 
sequently, the record linkage process has to decide among the remaining possibilities. If a 
non-link is assigned to every pair of records within a record ii'-tuple, then this X-tuple can 
be assigned directly to the subset 5'i/2/.../a" (see notation in Table [Tj). In practice this last 
step tremendously reduces the number of i^-tuples to be classified. 

Using the natural partial order in ¥k we provide a way to determine the subsets to which 
a record X-tuple can be assigned after blocking. We say that p' ^ p ii p' is a partition finer 
than or equal to p. Note that the blocking process provides a maximal pattern of agreement 
Ph for each record -R'-tuple Vj. Thus, the subsets to which Vj can be potentially assigned are 
those Sp such that p ^ pb- 

Example. In Figure [2] we present the cartesian product of two pairs of files after blocking. 
We illustrate using homicide data from the Armenia, Montenegro, and Quimbaya towns in 
the Colombian province of Quindio. In this example, only the gray elements of the cartesian 
product become part of the record linkage process, whereas the white elements become a 
priori non-matches. The left-hand side of Figure [2] represents the cartesian product of two 
Census and Police data subsets after blocking by town. The right-hand side represents the 
cartesian product of the same Census data subset and a Forensics data subset after blocking 
by gender. Note that in this example we assign the pair (ai(a), 02(6)) as a non-link since 
these two records refer to homicides in different towns. We also assign the pair (ai(a), 03(0)) 
as a non-link since these two records refer to different genders. Assuming that there are no 
non-link blocking assignments for (02(6), 03(0)), the multiple record linkage decision process 
has to classify the triplet {ai{a),a2{b),a3{c)) as either belonging to Si/2/3 or S'i/23- On the 
other hand, the two blocking processes illustrated in Figure [2] have no direct implications on 
the possible resolution of (ai((i), 02(6), 03(0)). 
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3 COMPARISON DATA 



In order to obtain appropriate data to model the probability that a certain record iC-tuple 
belongs to some subset Sp, let us determine the matching pattern for each common field of 
recorded information. If for a certain record iT-tuple we search for agreement among the 
information recorded for a certain field, we can associate each entry of the record iT-tuple 
with a number in {1, 2, ... , K} and a certain partition of this set would describe the matching 
pattern of the record K-tuple for the field in consideration, grouping in the same element of 
the partition all the ET-tuple entries that agree in the field being compared (similar to Section 
[2]). An alternative way to explain this idea is as follows. For some record -ftT-tuple, let us 
compare the information of the records from the datafiles k, k' , and k" for a certain common 
field. Due to transitivity of agreement, if records k and k' agree and k' and k" agree, then 
k and k" agree necessarily. Thus, since agreement is an equivalence relation, each matching 
pattern for each field for each record -fC-tuple is a partition of K points, because for any 
equivalence relation on a set, the set of its equivalence classes (sets of records agreeing) is a 
partition of the set. 

Now, let 'jp'^ = 1 if the record i^-tuple rj has the matching pattern p in the field /. 



Census 



a, (a) 

Females 



Montenegro Quimbaya 

Police 



Forenslcs 



Figure 2: Cartesian products of Census and Police homicide data after blocking 
by town (left), and Census and Forenslcs homicide data after blocking by gender 
(right) for three towns in Colombia. Only elements in gray blocks are potentially 



linked. Black elements are discussed in the example of Section 2.1 
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Then, for each field / = 1,...,-F, of each record i^-tuple rj, we obtain a vector j^f = 
{rfxj^i /K^ - ■ ■ • ■ • '7i2 k)^ where only one entry is equal to one and the rest are equal 
to zero. Note the length of the vector 7-^^ is Bk, since this is the number of patterns of 
agreement for each field. Finally, the comparison data for Vj contains the comparison vectors 
for all the F fields, and can be written as 7-^ = (7-'^ , • • • , ^■'^ ■, ■ ■ ■ ^ 7"'^)) which takes values over 
{Bk)^ possible matching patterns. 

Similarly as in Section [2j we can represent the patterns of agreement presented in this 
section by unions of complete undirected graphs (see Rosen, 2007, p. 448) as in Figure [Tj In 
those graphs, each vertex represents the value of certain field in certain record that belongs 
to certain datafile k = 1, . . . The vertices k' and k appear connected if the values that 
they represent agree, otherwise, the vertices appear disconnected. 

Example. Let us expose how the comparison data work when we need to link three 
datafiles. In this case, we can represent the patterns of agreement as five unions of complete 
undirected graphs, as presented in Figure jlj For if = 3, ^^f = (71/2/3' 7i2/3' ^13/2' ^^1/23' "''123) 
represents the comparison data for the field / (say age, ethnicity, etc.) of the record triplet 
rj, and the length of the full comparison data for each record triplet is SF, if the datafiles 
have F common fields. 



4 MODEL FOR MATCHING PROBABILITIES 



The probabilities P{Sp\'^^ 



P{rj G Sp\^^), p G Fk, can be found using P{'y^\Sp) 



P(7^>, e Sp) and P{Sp) = P{rj G Sp), as P{Sp\^^) = P{^^\Sp)P{Sp) / P{-i^), where 

^(7^) = E P{l'\Sp)P{Sp). 

Let = (5^/2/ /x' • • • '512 k) vector that indicates the subset Sp that contains the 

record ET-tuple rj, such that gp = 1 ii rj S Sp and gp = otherwise. Thus, it is clear that 
S'p ~ N°'^' — (5"' '7"') be the (partially observed) complete data vector for rj. 

Note that after blocking, some entries of g^ are fixed as zeroes for some record /C-tuples. 



Winkler (1988), Jaro (1989), and Larsen and Rubin (2001) proposed to model the corre- 



sponding complete data for bipartite record linkage, where g^ is taken as a latent variable. 
For multiple record linkage, the model for x^ is stated as 



p(x^i$) = n [p{i'\sp)p{Sp) 



gp 
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Under the conditional independence assumption of the comparison data fields, we obtain 



P{l^\S,) = l[P{^^f\Sp). 



(3) 



Each ■y^f represents the matching pattern of rj in the field /, which corresponds to categorical 
information that can be modeled by using a categorical distribution (or multinomial with just 
one trial) as 

p{i''\Sp)= IT {niji' (4) 



where T^p/^p = P{lp^ = ^I'S'p), and p' is just another indicator of the patterns of agreement 
in Fk- Defining Sp = P{Sp), under independence of the complete data, the complete log- 
likelihood for the sample x = {x^;j = 1, . . . , n} is obtained as 



n p 



b 



i=l p&K 



/=ip'ePA' 



The set of parameters in the log-likelihood above is $ = (s, 11), where s is a vector of length 
Bk given by s = (si/2/.../_ft'5 • • • , si2...k) and 11 can be arranged in a set of F matrices of size 
Bk X Bk-, each one given by 



^1/2/. ../Xll 



/2/.../K\ll2/...IK 



VT' 



/ 

p'\l/2/.../K 



\ '^12...K\l/2/.../K 



vr 



/ 

1/2/. ../K\p 



p'\p 



vr: 



/ 

12. ..K\p 



vr: 



/ 

1/2/ .../K\12...K 



^p'\12...K 



TT' 



12...K\12...K I 



for / = Hence, the length of ^ is Bk{BkF + !)• In order to estimate these 

probabilities, since the vectors are only partially observed, the estimation is made via 



maximum likelihood using the EM algorithm (Dempster et al. , 1977). The model presented 



in this section generalizes the one used by Winkler ( 1988 ) and Jaro ( 1989 ), and uses the strong 



assumption that the comparison data fields are conditionally independent given the i^-tuples' 
membership to the subsets Sp. In Section [7] we show that this baseline model produces good 
results for the Colombian homicide data, but the modeling of the fields' dependencies may 



be a key factor in obtaining good linkage results in other contexts (see Larsen and Rubin 



2001). This is part of our ongoing work. 
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Example. For the particular case where K = 3, the length of $ is 5 + 25F, which is given 
by s = (si/2/3i •512/3, S13/2, •51/23) ■S123) which is composed by F matrices of size 5x5, 

as 



/ 1/ 



/2/3I1/2/3 
/ 

12/3|l/2/3 
/ 

13/2|l/2/3 
/ 

l/23|l/2/3 
V ^123|l/2/3 



VT' 



l/2/3|12/3 "l/2/3|13/2 



vr 



vr 



vr 



/ 

12/3|12/3 

^ vr 

13/2|12/3 "l3/2|13/2 



/ 

12/3|13/2 
/ 



vr 



^ vr 

l/23|12/3 'l/23|13/2 



/ 



vr: 



/ 



vr: 



/ 

123|13/2 



vr' 
vr 
vr 
vr 
vr: 



/ 

l/2/3|l/23 
/ 

12/3|l/23 
/ 

13/2|l/23 
/ 

l/23|l/23 
/ 

123|l/23 



vr-'^ \ 

l/2/3|123 * 

vr^ 

"l2/3|123 

vr^ 

13/2|123 

vr^ 

1/23|123 
^123|123 / 



123|12/3 

5 EM ESTIMATION 

The EM algorithm can be used to fit the parameters of a mixture model via maximum 



likelihood estimation (see Dempster et al. , 1977; McLachlan and Peel 2000, p. 47) and 



Winkler 1988 


Jaro 


1989 


Larsen and 



Rubin, 2001). Following the model presented in Section [4| let us find the equations of an 



EM algorithm to estimate <!?. Firstly, for the Expectation step, let us find the conditional 
distribution of 



P{xi) 



n 

PGPa- 



P{^'J\Sp)P{Sp 

9p 



9v 



I.e., g-' I7-' ~ 



n [piSpW)] 



Multinomial{l,P\'y^), where P\j^ = {PiSi/2/.../Kh^), ■ . ■ , P{Si2...Kh^)) . 



Thus, using the estimation $ from a previous M step of the algorithm, for the E step, the 
expectation of the unknown part of g^ is composed by 



P{l') 



(5) 



for p =4 Pb : where pb- represents the blocking pattern for rj. The term P{'^^) above is given 
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by 



p^Pb 



Let be equal to for the entries that are known to be zeroes, and let the remaining 
entries of g^ be filled with the values given in equation ([s]). For the Maximization step, we 
replace with g^ in the log-likelihood L and estimate ^ via maximum likelihood. We obtain 

for n 



2^i=i ^-t^lw 9p 

TT — 



and for s we obtain 



n 



Jaro 



(1989). Note that 



where n^j represents the frequency counts of each pattern , as in 
in this case we have {Bk)^ different patterns of 7-'. As usual, the algorithm stops when 
the values of ^ converge, which can be assessed measuring the distance between $ in two 
consecutive iterations. In order to start this algorithm, we choose initial values taking into 
account the fact that some probabilities must be greater than others. 



5.1 Starting Values 

Note that the parameters in each should hold certain restrictions. In record linkage 



these constraints are taken into account in order to start the EM algorithm (Winkler, 1993 



Lahiri and Larsen 



2005). For instance, it is clear that -n^^^ ^^^^ ^ should be greater than 



^1/2/ /K\i2 ' S^^^T^ that in a record i^-tuple all the entries refer to the same individual, 

the probability that their information agree should be larger than the probability that all 
their information disagree. However, note that '^1/2/ /k\i/2/ /k should not necessarily be 
greater than tt-^^ k\i/2/ /X' ^ record ET-tuple in which all the entries refer to different 

individuals, the probability that all their information disagree is not necessarily larger than 
the probability that all their information agree (this is the case for a field with a very common 
value) . 

Thus, given the high number of parameters it is not easy to determine which constraints 
should be taken into account. In order to determine the set of constraints to start the 
algorithm, we present a method that uses the natural partial order in ¥k- Remember that 
we say p' ^ p if p' is a partition finer than or equal to p. In order to determine if ^^p/^p should 
be greater or lower than T^p/i^p for p,p',p" G Fx, we fix the partition p and for all partitions 
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p' ,p" , such that p" ^ p' ^ p we set T^pi/^p ^ '^p'\p- other case we do not have a criterion 

to order tt^,^^ with respect to 7r^„|p. 

Note that this procedure can be visuaHzed using a directed graph in the following way: 

1. Construct the Hasse diagram of the partitions p' G writing in each node '?i"p/|p where 
p is a generic partition. 

2. Assign a specific partition to the generic p. 

3. Search for the node where p' = p. 

4. For all the branches under this node, set an inequality > between each "father" node 
and each "son" node. 

5. Repeat steps 2-4 until exhausting the possible partitions. 

We can use similar ideas to identify the constraints for Si/2/.../Ki • • • > si2...k- We simply 
have that Sp' > Sp whenever p' ^ p. Naturally, the set of inequalities among the probabilities 
Sp can also be represented in a Hasse diagram. Furthermore, if the datafiles being linked have 
no duplicates, the size of the complete links set S12...K should be smaller than or equal to 
the smaller datafile size, from which is reasonable to take starting values for S12...K smaller 
than minjmfc; k = I, . . . , K}/n, where represents the number of records in datafile k. In 
general we can determine the maximum size of any set Sp if we assume no duplicates into each 
datafile. Denote qp as a generic element of the partition p G Fk, i.e., qp is a subset of Nk- 
Thus, the maximum size of Sp is JlgpSp minlm^; k G qp}, from which is reasonable to start the 
algorithm taking values lower than J^^^g^ minjmfc; /c G qp}/n for a generic Sp. The starting 
value for Si/2/.../_ft' is determined as one minus the other Sp. Notice that since duplicates are 
rather common in practice, the above values are merely a guide to start the EM algorithm. 
Finally, since latent class models have multiple solutions corresponding to local maxima of 
the marginal likelihood, in practice we would take different starting values holding the above 
constraints, and we would choose the parameters with the maximum marginal likelihood for 



the observed data 7-^ (e.g., see McLachlan and Peel, 2000). 



Example. We illustrate this procedure for K = 3 using the left hand-side of Figure [3} 
Go to the left panel of Figure jsj and replace p with 123. Since 7r:[23|i23 ™ 
graph, we take the set of constraints ^(231123 ^ ^(2/31123 ^ ^V2/3|i23; ^(231123 ^ ^(3/21123 ^ 



^i/2/3|i23' ^123|123 — '^i/23|i23 — ^i/2/3|i23' ^hich correspond to the three different branches 
under 7r^23|i23- Now, replace p with 12/3. Since in this case the node 7rf2/3|i2/3 only one 
descendent, we only get the constraint T^{2f^\-i2/'^ — ^f/2/3n9/3' This step is similar for 13/2 



and 1/23. Finally, if we replace p with 1/2/3 we can see that the node '^{/2/3|i/2/3 ^o^^ not 
have descendants, so we do not set constraints for the probabilities ''^^^1/2/3^ p' G P3. 
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For K = 3, the right hand side of Figure |3] represents the set of inequahties for the starting 
values We obtain for instance s[/2/3 ^ '^1/23 ^ ■^123- Also, for this particular case we 
take 5^23 < min{mi, m2, maj/n, ^['^23 < mi min{m2, maj/n, and similar inequalities for 
■^13/2 ^^"^ ^12/3^ whereas s[/2/3 ~ ^ ~ ^1/23 ~ ^13/2 ~ ^12/3 ~ "^iss- 




Figure 3: Hasse diagram to determine the set of inequalities between 
probabilities 71^,1^ and s^p\ The possible inequalities are established 

from sources to targets in the arrows, e.g., 5^23 < ■5[2/3- 



6 LINKAGE ASSIGNMENT: GENERALIZED 
FELLEGI-SUNTER DECISION RULE 

The goal of multiple record linkage is to classify each record i^-tuple to the appropriate 



subset Sp. For bipartite record linkage, Fellegi and Sunter (1969) proposed the computation 



of likelihood ratios as weights for the assignment of record pairs as matched or unmatched 
pairs. Their procedure is equivalent to test the hypothesis that each record pair belongs to 
the subset of unmatched record pairs, against the hypothesis that it belongs to the subset of 
matched pairs, and vice versa. 
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6.1 Likelihood Ratios and Weights 

In multiple record linkage, there are several subsets of records denoting all the possibilities 
of matching between records from different datafiles. Following Fellegi and Sunter's idea, 
for each record K -tuple and for each subset, we propose to compute weights following a 
hypothesis test, where the null hypothesis is the record K-tuple membership to a certain 
subset, i.e., rj G Sp, against the hypothesis that this record -RT-tuple does not belong to the 
subset, i.e., rj £ Sp, where the superscript c denotes the complement of the set. By using a 
log-likelihood ratio we obtain 

The informal idea of the use of the weights Wp is that we would order the record i^-tuples 
according to their respective weights and we would assign i('-tuples with large Wp to the sub- 
set Sp. However, the ordering obtained from Wp can be obtained in a simpler way, regardless 
of the model for P{'y^\Sp). 

Proposition 1. The ordering obtained from Wp, logit[P(5p|7-')] and P{Sp\^^) is the same. 

Thus, for ordering and decision purposes we can simply use P{Sp\^^) (see proofs in the 
Appendix [a]) . We still need to determine, however, the cutoffs from which we declare record 
X-tuples' memberships. 



6.2 Cutoff Values 

In bipartite record linkage, in order to declare a record pair as matched or unmatched, the 
Fellegi-Sunter method orders the possible values of 7-' by their weights in non-increasing 
order, determines two cutoff values of the weights, and, according to them, declares matches 
and non-matches. For multiple record linkage, we extend this procedure and prove its opti- 
mality. 



Theorem 1 . The decision procedure described below maximizes the probability of assigning 
each record A'-tuple to the right subset, subject to a set of admissible error levels ^p. 

1. Each record /C-tuple is potentially declared to belong to the subset Sp if and only if p is 
the pattern for which P{Sp\^^) is maximum among all possible patterns in P^^. Thus, 
the set of record /C-tuples is partitioned into Bk subsets, and for each record iT-tuple 
in one of these partitions we consider only two possibilities, whether to declare it to 
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belong to the subset Sp or to keep it undeclared. 

2. For the record -ftT-tuples in each partition, we order the possible values of by their 
weights (or equivalently by P{Sp\j^)) in non-increasing order indexing by the subscript 

3. We find one value {j')p for each set of weights related to each subset, in order to 
determine the record K-tuple memberships. The value {j')p is found such that 

(j')p-i 

where ^ip = P(assign rj the membership of Sp\rj G Sp) is an admissible error level. 
Each P{-f^^'^\Sp) can be computed as 



1 



4. Finally, for those record if-tuples with configurations of j^^'>p, [j)p = 1, . . . , {j')p — 1, we 
decide that they belong to the subset Sp. For those record K-tuples with configurations 
^0)p with {j)p > {j')p, we keep them undeclared. 

In the Appendix we show that the above decision rule is optimal under the availability of 
the true matching probabilities. We show that this decision rule minimizes the probability 
of assigning each record -fC-tuple to the wrong subset Sp or keeping it undeclared, subject to 
a set of admissible error levels fip, or namely, it maximizes the probability of assigning each 
record -R'-tuple to the right subset, subject to a set of admissible error levels fip. The Fellegi- 
Sunter decision rule for bipartite record linkage can be obtained as a corollary of Theorem 
1. In practice the optimality of this decision rule depends on the quality of the estimation 



of the matching probabilities. Belin and Rubin (1995) and Larsen and Rubin (2001) provide 



evidence that nominal and actual error levels disagree in different applications. Belin and 



Rubin (1995) proposed a method to calibrate error rates as a function of cutoff values for 



bipartite record linkage. This is an important problem that we expect to address in our 
ongoing work for the multiple record linkage context. 
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7 LINKING HOMICIDE RECORD-SYSTEMS IN 
COLOMBIA 



The Colombian homicide data described in Section 1.2 was provided by the Conflict Analysis 
Resource Center (CERAC) where a linkage by hand was performed for a subset of the data, 
corresponding to the province of Quindio for the last three months of 2004. In this section we 
present an application to the integration of these three datafiles. In this period, 67, 62, and 
33 homicides were recorded by the Census Bureau, the National Police, and the Forensics 
Institute, respectively. The common fields of these three datafiles are town and date of the 
homicide, gender, and age of the victim. 

An outline of the implementation of the method is as follows: 

1. Find the set of record triplets that are suitable for classification into the different match- 
ing patterns. This set is obtained after blocking. 

2. Compute the comparison data according to the possible patterns of agreement for all 
the triplets to be classified and for every common field. 

3. Train the mixture model of the distribution of the comparison data. 

4. Divide the set of triplets according to the subsets Sp for which P{Sp\'y^) is maximum. 

5. Within each subset, sort the triplets by P{Sp\'y^) and use an admissible error level to 
either declare the triplets as belonging to the subset Sp or keep them undeclared. 

In order to implement the method, we used town of the homicide and gender of the victim 
for blocking. We assigned the membership to the subset Si/2/3 to the triplets with blocking 
pattern 1/2/3. We used the proposed method to classify the remaining triplets. In order 
to use date of the homicide and age of the victim, we explored several options, but we only 
report the results of using three of them (Table [2]). The first option only includes exact 
comparison data for both variables. The second option constructs three categorical variables 
from each variable age and date, and creates comparison data using these new categorical 
variables. These variables are constructed in the following fashion: The categories of the 
variable AgeA are 0-2, 3-5, and so on; the categories of the variable AgeB are 0, 1-3, 4-6, 
and so on; and finally, the categories of the variable AgeC are 0-1, 2-4, and so on. A similar 
procedure is used for date of the homicide, starting from the first day of the period of the 
data. The third approach uses the previous categorical variables and in addition exploits 
a specific structure of the age recorded in these datasets in order to create an additional 
blocking variable. The ages recorded in these three datafiles present two gaps, this is, there 
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are no homicides recorded in the 5-11 and 56-65 age intervals. Thus, we create a new blocking 
variable that classifies "kids" , "young" , and "elderly" individuals. We think it is safe to use 
this variable for blocking since no records with similar ages are assigned to different blocks. 
Also, to help the EM algorithm to identify the appropriate clusters, we replaced P{Sp\^^) by 
1 for those triplets with 7^-^ = 1 for ah the fields / and for p G {12/3, 13/2, 1/23, 123}. This 



semi-supervised approach is a missing data problem under multinomial sampling (Dempster 



et al. , 1977). We made the final assignments using nominal error levels /Up = 0.01 for all p. 



Table 2: Error rates of multiple record linkage assignments for Census (1) - Forensics 
(2) - Police (3) record triplets. Three comparison data options for age of the victim 
and date of the homicide. OME: Overall Misclassification Error, MWGE: Mean 
Within Group Error. 

Misclassification Error 





Age and Date Data 


1/2/3 


12/3 


13/2 


1/23 


123 


OME 


MWGE 


1. 


Exact comparisons 


0.6203 


0.2216 


0.3915 


0.0079 


0.4444 


0.5977 


0.3371 


2. 


Three comparison 


















categories 


0.0470 


0.0109 


0.0803 


0.0510 


0.0370 


0.0471 


0.0453 


3. 


Three comparison 


















categories + blocking 


0.0365 


0.0079 


0.0598 


0.0082 


0.0370 


0.0359 


0.0299 




Kid- Young-Elderly 

















In Table [2] we present different measures of the performance of the multiple record linkage 
decisions using the three different options for the inclusion of the information about age of 
the victim and date of the homicide. These measures were obtained after comparing with the 
results of the hand matching procedure, which is thought to be more reliable. Besides the 



usual misclassification errors, we present the mean within group error rate (Qiao and Liu 



2009), which controls the different sizes of the clusters Sp by taking the average of the error 
rates for each Sp. From the first age and date comparison data, we can see that the multiple 
record linkage procedure can produce catastrophic results if it is not used carefully. For 
this scenario all the misclassification errors are very high, which indicates that the multiple 
record linkage process did not find the appropriate clusters. For the first comparison data 
only exact comparisons were included, hence small differences in age and date were treated 
the same as large differences. For the second age and date comparison data the results 
improved significantly. The way these comparison variables were created is such that if there 
is exact agreement in age or in date, the three corresponding comparison variables agree. If 
there is a difference of one unit, two of them agree, and if there is a difference of two units, 
only one of the variables agree. This approach is more flexible to capture small measurement 
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error in age and date. The final approach additionally blocks three categories of age, which 
helps to reduce the number of misclassified triplets. For this final approach all the measures 
of misclassification error are very close to zero, which indicates that multiple record linkage 
can provide good results if used properly. Naturally, the good performance of the method 
depends on the specific datafiles to be linked and the models implemented. 

We performed a bipartite record linkage for each of the three pairs of datafiles using the 
same blocking variables and the same comparison data as the third approach in Table [2j The 
assignments were also made using nominal error levels of 0.01. For the triplets on which a 
decision could be made, the overall misclassification error was 0.0435 and the mean within 
group error was 0.0311. When trying to combine the decisions of the three independent 
procedures, however, we obtained a set of 43 record triplets on which we could not assign a 
decision. Among this set of record triplets the multiple record linkage procedure coincided 
with the hand matching procedure in 32 cases (74%). Of course the performance of the 
method for those record triplets is not as good as the general performance, since these record 
triplets are usually the ones that are more difficult to classify. However, multiple record 
linkage provides a decision along with a measure of uncertainty for that decision (namely, 
the matching probabilities), something that is not available from reconciling bipartite record 
linkages. 



8 SIMULATION STUDIES 

In practice, the performance of our method will depend on several factors: (1) the amount 
of measurement error of the datafiles, (2) the number of common variables and their number 
of categories/ variability, (3) the sizes of the datafiles and their overlaps, (4) the dependence 
structure among the recorded fields, (5) the existence of replicate records in the datafiles, 
etc. Here we explore the performance of the proposed method under some simple scenarios, 
emphasizing how measurement error affects our results. We used the R language to perform 



our simulations (R Development Core Team 2010). 



8.1 Generating Measurement Error 

Tancredi and Liseo| ( |2011[ ) use a simplified version of the hit-miss model (Copas and Hilton 



1990) in order to generate measurement error. This model for categorical information on 



records measured with error is given by 

PiYj^"^ = yf\Yf = y)) = (1 - Pf)I{yf = yj) + I3f/Cf (6) 
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where Y?^"^ represents the observed field / and Yf represents the true value of the field /. 
Both y^^^ and Yf have support {y^, . . . , y^, . . . , y^^}, where C/ represents the number of 
categories of the field /. Equation ([g]) includes a measurement error parameter which 
represents the probability of measurement error for the field /. This model establishes that 
conditioning on the unobserved true values, we can model each single record field as a mixture 
of two components: the first component is concentrated on the true value while the second 



one is uniformly distributed over the support of the field (Tancredi and Liseo, 2011). In 
our simulation studies we do not generate error for the blocking variable. For the numerical 
variables we generate error using the following model 

P{Yf"'^ = yf\Yf = y}) = (1 - (3f)I{yf = yj) + ^/-2'\yf -yf\ I{\yf - < 3), (7) 

which allows measurement error around the true value. For our simulation study we consider 
the same value of /3/ for all the fields subject to error (so we drop the subindex /). 



8.2 To Block or Not to Block? 

Blocking is usually an important component of record linkage since working with the complete 
cartesian product of the datafiles is computationally inefficient. In this section we show that 
we need blocking to obtain good classification results. Thus, we may want to block even in 
the presence of adequate computational power to handle the record linkage process on the 
complete cartesian product. 

We take the Census homicide data as the true population information and we generate 
three equal-size datafiles subject to measurement error. We generate measurement error 
according to the model ([T]) for date of the homicide and age of the victim. We do not generate 
measurement error for sex of the victim and city of the homicide since we use these variables 
for blocking. We simulate 100 triplets of datafiles and for each triplet we perform multiple 
record linkage using the second option of comparison data presented in Section [7j In Figure 
|4] we present the performance results for three values of the measurement error parameter: 
0.05, 0.10, and 0.15. We compare the results of our method without blocking (solid line) and 
after blocking by gender of the victim and city of the homicide (dashed line). In panel (a) 
of Figure |4] we average over all the simulations the mean within group error as a measure of 
the general performance of the method (or in other words, a measure of the performance of 
our method on \JpSp). In panels (b) to (f) we present the average misclassification error for 
each specific subset Sp. 

We can see that, for this example, the effect of blocking is huge. In general, the error 
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rates arc very large when we use no bloeking, but they decay to values close to zero under 
blocking. Note also that the larger the measurement error, the larger the error recovering the 
subsets S'i23 and 81/2/3^ which indicates that measurement error causes true triple links to 
be missed and false links to be created. 
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Figure 4: Measures of misclassification error for non-blocking (solid line) and block- 
ing (dashed line) scenarios. 



8.3 Number of Blocks and Low— Quality Fields 

In certain applications there are different blocking options and the possibility to include low- 
quality fields in the linkage process. In this section we explore these scenarios. We generate 
three databases containing five independent common fields across the different scenarios. 
These first five fields contain 3, 5, 10, 10, and 15 categories, respectively, and each category is 
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generated with equal probability. We also use one additional independent blocking variable 
in order to check the performance of the method under blocking. We consider three different 
blocking scenarios which correspond to 5, 10, and 15 categories of the blocking variable, where 
the categories are generated with equal probability. For all the simulation scenarios, the sizes 
of the databases and their overlaps are the same as in the Colombian homicide data. 

For one of the fields with 10 categories, we use /3 = .7 in order to simulate a scenario 
where a common variable is available, but it is known that its quality is low. We keep /3 = .7 
for the previous variable across three different measurement error scenarios for the remaining 
four fields. These three scenarios correspond to three different values of /3: 0.05, 0.10, and 
0.15, and in each scenario the same /3 is used to generate error for the remaining four fields. 
Given the three true databases, we generate 100 triplets of observed databases using the 
hit-miss model (|6|. For each triplet of databases we performed six implementations of the 
proposed methodology for multiple record linkage. The six implementations correspond to 
the combination of including/excluding the low quality field and the three blocking options. 
We made the final assignments using nominal error levels /Xp = .01 for all p. 

To evaluate the performance of the method in terms of recovering the classes S'p, we 
report the misclassification error rate for each class Sp and the mean within group error rate 
(Qiao and Liu, 2009) for the triplets that were assigned to a certain group. The mean within 
group error rate is more meaningful than the overall misclassification error for record linkage 
since the groups Sp are extremely unbalanced, e.g., the subset 5'i/2/3 is massive whereas the 
subset 5123 is extremely small. We present the results in Figure [sj where panel (a) shows 
the average over all the simulations of the mean within group error (MWGE) and panels 
(b) to (f) show the average misclassification error for each class Sp. All the panels show the 
performance measures as a function of the measurement error parameter. The solid, dashed, 
and dotdashed lines represent the error values for the method with 5, 10, and 15 blocks, 
respectively. The grey lines represent the method including the low-quality extra field. Note 
that the scale of the vertical axes is the same for panels (a) to (e), but we present panel (f) 
with a different scale since the errors for the subset 5123 are significantly larger compared to 
the other subsets. 

We can see that, in general, the larger the measurement error, the larger the error rates, 
which is something that one would expect. We can also see that under all the scenarios, 
increasing the amount of blocking decreases the error rates. In particular, note in panel (f) 
that blocking has a huge impact on the reduction of the misclassification for the class 5123. 
Finally, we note that for each blocking scenario, the inclusion of the low-quality extra field 
increases the error rates. 
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Figure 5: Measures of misclassification error for different number of blocks and in- 
clusion/exclusion of low-quality fields. The blocking scenarios are 5 blocks (solid 
line), 10 blocks (dashed line), and 15 blocks (dotdashed line). The grey lines repre- 
sent the performance of the method including the low-quality extra field. Note the 
different scale of panel (f). 

9 CONCLUSIONS AND FUTURE WORK 

Our method provides a framework for the integration of more than two datafiles without 
common identifiers. The ideas are an extension of the theory proposed by lFehegi and Sunter 



(1969) and its more modern implementations, as in Winkler (1988) and Jaro (1989). The 
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method solves the problem of obtaining non-transitive decisions, as it is common when rec- 
onciling bipartite record linkages. Our method also provides matching probabilities for the 
record X-tuples, something that is not available from reconciling bipartite record linkages, 
but that is necessary in order to incorporate the uncertainty of the linkage procedure in poste- 



rior analysis such as regression ( Lahiri and Larsen 2005 ) . We proposed a decision rule which 
is optimal under the availability of the true matching probabilities. In practice, however, 
the optimality of the decision rule hinges on the availability of well-calibrated probability 
models, i.e., good estimates of the probability of a particular i^-tuple belonging to the sub- 
sets Sp. Thus, we need to consider models that go beyond the present one and that capture 
dependencies between fields (e.g., see Larsen and Rubin, 2001). Nevertheless, even using a 
naive model, our method performed well both in the integration of the Colombian homicide 
datafiles and in our simulations. 

We believe our method holds promise in the context of record linkage for census coverage 
measurement evaluation. For example, the U.S. Census Bureau has for several decades done a 
two-sample linkage between the actual enumeration and data from a post-enumeration survey 



based on data from a nationwide sample of census blocks (Hogan, 1992, 1993). Additional 
sources of data that could be used to improve coverage estimation include the American 
Community Survey and various administrative record files. Incorporation of them would 
require linkage of -ftT > 3 datafiles, using methods that could build upon the work described 
here that would take into account multiple sampling designs and census adjustments such as 
imputations and erroneous enumerations. 



A APPENDIX: PROOFS 

In the proofs presented below we use the notation introduced in Section |4| where for instance, 
P{Sp) means P{rj G Sp), and so on. 

Proof of Proposition 1. The ordering of Wp is the same as the ordering of logit [^'(»S'p|7-' )] 
since 



, ^ P{SpW)IP{Sp) 

^ P{S-ph^)/P{S^p^ 



wi = log 



oc lo, 



= logit [P{Sphn]- 

Finally, the logit function is a monotonic increasing function of its argument, thus the order- 
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ing of logit [P(S'p|7-')] is the same as the ordering of P{Sp\^^). 

Proof of Theorem 1 . Optimality of the Generalized Fellegi-Sunter Linkage Rule. 

Let us define the set of possible decisions for a record if -tuple. Let us call Dp the decision 
of assigning a record K-tuple to the subset Sp and the decision to keep the record K-tuple 
undeclared. Thus, a decision function d is a {Bk + l)-tuple given by 

d(7^) = {P{Dyy.../Kbn, • • ■ , P{Dp\j^), . . . , P{D,2...K\in,P{Duhn) 

where 

P{Du\j') + P{DpW) = 1. 
The proposed decision rule Lq is such that 

Popp|7^) = i, if (i)p<(/)p-i; 
Po{D^W) = 1, if > 

for (j)p in the subset of record if-tuplcs for which P{Sp\^^) is maximum and {j')p is obtained 
as in the statement of Theorem 1. This decision rule minimizes the probability of assigning 
each record ii'-tuple to the wrong subset Sp or keeping it undeclared, subject to a set of 
admissible error levels jjLp = P{Dp\Sp), p G Fk- For decision rules Lq and Li 

tip = p{Dp\s;) = J2Po{Dp\^^^^^)p{^(^^^\s;) = J2PiiDph^'^-)Ph^'Ms;). 

{j)p {3)p 

Prom the construction of Lq we obtain 

Pi^^^Ms;) = Y Pi{Dp\j^^^Pi^(^Ms;) 

(i)p<(i')p-i ij)p 

or 

Y P{7^'^S;)[l-P,{Dp\^(^)^)]= Yl Pi{Dph^^^-)P{7^^^S;). (A.l) 
(i)p<(i')p-i U)p>U')p 

Since 

P(7«-|5p)P(7(^)-|5^) < P{J^^^Sp)P{'y^'^S;) 
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whenever (j)p < {i)p we have 



L(i)p<(i')p-i 



l-Pi(Dp|7(^')-) 



< 



ij)p>U')p 



.(i)p<{i')p-i 



l-Pi(Z)p|7(^)^) (|A.2) 



dividing (A.2) by (A.l) we obtain 



(j)p>ij')p 



< 



E Ph^'^Sp)[l-PiiDp\j^^^r>) 

.(i)p<{i')p-i 



from which 



ij)p 



U)p-) 



< 



U)p 



U)p) 



which is the same as 



Pi{Dp\Sp) < PoiDp\S, 



(A.3) 



which imphes 

PiiD;\Sp) > Po{d;\Sp) 

for all p G Px- Note that the probability of taking a wrong decision or not deciding can be 
written as 

E p{D;nSp) = E p{D;\Sp)p{Sp), 

which is minimized by the generalized Fellegi-Sunter linkage rule Lq, as we can see using 
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